Hello Oeda Platform Users, Oeda is a user-interactive Platform for analyzing Brazilian E-commerce trends. \ By creating a user-friendly, intuitive tool for e-commerce, we aim to make it easier for merchants to understand and analyze the market and make data-driven decisions to improve their sales and meet consumer needs. \ This part of the project aims to clean and use Brazilian E-commerce data for analytics purposes.
Data cleaning is identifying and correcting errors, inconsistencies, and missing data in a dataset. It is a critical step in data preparation, ensuring that the data is correct, consistent, and suitable for analysis. \ The data cleaning process typically involves several steps. First, the data is examined to identify any errors or inconsistencies. This process may involve visual inspection of the data or using algorithms and statistical techniques to detect abnormalities. Once these errors and inconsistencies have been identified, they must be corrected or removed. These problems may involve manually fixing individual records or using algorithms to correct or impute missing data automatically. Finally, the cleaned datasets are validated to ensure it is correct and ready for analysis.
At the very first step, we want to set up the necessary packages needed for the data cleaning processes. We would run the following commandsTo install the pandas and sql packages using pip.
This would download and install the pandas and sql packages and any other necessary dependencies. Once the packages are installed, we can later use them in our Python programs by importing in the next step.
The pandas package provides tools for working with data in Python, including functions for reading, writing, and manipulating data in various formats. The sql package provides tools for working with SQL databases in Python, including functions for executing SQL queries and managing database connections. Together, these packages can be used to perform data analysis and manipulation tasks involving both tabular data and SQL databases.
!pip install pandas==1.1.5
!pip install pandasql
In this part, we import various packages that will be useful for our data import process. We mainly use pandas to process the data. These packages will be beneficial for future data-cleaning strategies.
There are many advantages to importing packages like numpy, datetime, random, and sklearn in our Python programs. Some of these advantages include the following:
Functionality: These packages provide a wide range of functions and tools that can be used to perform various tasks in Python, such as working with arrays, dates and times, random numbers, and machine learning algorithms. These functions allow us to write more powerful and flexible programs to solve various problems.
Efficiency: The functions and tools provided by these packages are often highly optimized, which means they can be executed more quickly and efficiently than if we were to implement the same functionality from scratch. This benefit can save time and resources and make our programs run faster.
Consistency: By using these packages, we can ensure that our code is consistent with established best practices and standards. This feature can make our code easier to read and understand and help prevent errors and bugs.
Community: These packages are widely used and well-supported, which means that a large community of users and developers can provide help and support if we have any questions or issues. This trend can make learning and using these packages easier, giving us access to knowledge and resources.
Overall, importing packages like numpy, datetime, random, and sklearn can help us write more powerful, efficient, and consistent Python programs.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import re
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter
import random
from random import sample
import sklearn
We find two datasets that are useful for the project. One of them is the Brazilian E-commerce Dataset which has information on the orders made at the Olist store. The dataset has detailed order information from 2016 to 2018. Here we aim to use pandas to transfer data from CSV form to the data tables. We can then use the tables for future cleaning processes.
To import data from Google Drive using the read_csv method in the pandas' package, we first ensure that we have installed the pandas' package and authorized our Python script to access our Google Drive account.
Then we can use the read_csv method to read a CSV file from our Google Drive account into a pandas data frame. We use this method to clean multiple data sets.
We can then use the various methods and functions of the pandas' package to manipulate and analyze the data.
Note: cleaned data
from google.colab import drive
drive.mount('/content/drive')
items_data = pd.read_csv('/content/drive/My Drive/olist_order_items_dataset.csv')
product_data = pd.read_csv('/content/drive/My Drive/olist_products_dataset.csv')
city_data = pd.read_csv('/content/drive/My Drive/BRAZIL_CITIES.csv')
customer_data = pd.read_csv('/content/drive/My Drive/olist_customers_dataset.csv')
geolocation_data = pd.read_csv('/content/drive/My Drive/olist_geolocation_dataset.csv')
payment_data = pd.read_csv('/content/drive/My Drive/olist_order_payments_dataset.csv')
order_data = pd.read_csv('/content/drive/My Drive/olist_orders_dataset.csv')
category_data = pd.read_csv('/content/drive/My Drive/product_category_name_translation.csv')
seller_data = pd.read_csv('/content/drive/My Drive/olist_sellers_dataset.csv')
review_data = pd.read_csv('/content/drive/My Drive/olist_order_reviews_dataset.csv')
# items_data = pd.read_csv('/content/drive/MyDrive/items.csv')
# product_data = pd.read_csv('/content/drive/MyDrive/product.csv')
# city_data = pd.read_csv('/content/drive/MyDrive/city.csv')
# customer_data = pd.read_csv('/content/drive/MyDrive/customer.csv')
# geolocation_data = pd.read_csv('/content/drive/MyDrive/geolocation.csv')
# payment_data = pd.read_csv('/content/drive/MyDrive/payment.csv')
# order_data = pd.read_csv('/content/drive/MyDrive/order.csv')
# category_data = pd.read_csv('/content/drive/MyDrive/product_category_name_translation.csv')
# seller_data = pd.read_csv('/content/drive/MyDrive/seller.csv')
# review_data = pd.read_csv('/content/drive/MyDrive/review.csv')
#pd.set_option('max_columns', None)
Note: original data (before cleaning)
We use read_csv() again here to read CSV files containing data about items, products, cities, customers, geolocations, payments, orders, categories, sellers, and reviews. We could use the read_csv method to read this data into a pandas data frame. We can then use it to access and manipulate the data in various ways.
items_data = pd.read_csv('olist_order_items_dataset.csv')
product_data = pd.read_csv('olist_products_dataset.csv')
city_data = pd.read_csv('BRAZIL_CITIES.csv')
customer_data = pd.read_csv('olist_customers_dataset.csv')
geolocation_data = pd.read_csv('olist_geolocation_dataset.csv')
payment_data = pd.read_csv('olist_order_payments_dataset.csv')
order_data = pd.read_csv('olist_orders_dataset.csv')
category_data = pd.read_csv('product_category_name_translation.csv')
seller_data = pd.read_csv('olist_sellers_dataset.csv')
review_data = pd.read_csv('olist_order_reviews_dataset.csv')
pd.set_option('max_columns', None)
The items dataset in the Brazilian e-commerce dataset contains information about the items sold on a Brazilian e-commerce platform. This dataset includes the order number, seller name, item price, and item weight for each sold item. We find the items data from the Brazilian E-commerce dataset very useful because it contains information about a specific item in order. From items_data, we can learn about each item's associated order, product, seller, price, and freight value.
The order number is a unique identifier for each order placed on the platform. The seller's name is the seller's name who sold the item. The item price is the price of the item in Brazilian reais. And the item weight is the weight of the item in kilograms.
This information can be used to understand the characteristics of the items sold on the platform, such as the prices and weights of the items and the sellers who sold them. It can also be used to analyze trends and patterns in the sales data, such as which sellers had the most sales or which items were the most popular.
items_data
items_data.describe()
The product table in the Brazilian e-commerce dataset contains detailed information about each product sold on the platform. This information includes the category of the product, its volume dimensions, and its weight.
Having this information in a separate table can be helpful for various purposes. For example, we can use it to understand the characteristics of the products sold on the platform, such as their categories, dimensions, and weights. This info can help analyze trends and patterns in the sales data, such as which types of products were the most popular or which products had the most significant volumes.
In addition, having this information in a separate table can make it easier to perform queries and explorations on the sales data. For example, we could use SQL queries to join the product table with other tables, such as the order table, to find out which products were ordered the most or to analyze the sales data by product category. This method can help us gain a more detailed and nuanced understanding of the sales data and support more advanced analyses and insights.
product_data
The City dataset in the Brazilian Cities dataset contains information about the demographics of Brazilian municipalities. This information includes each municipality's city name, state, location, taxes, GDP, and population.
This information in the dataset can help study geographic insights about sales and marketing. For example, we can use this information to understand the demographics of the municipalities where the sales were made, such as their population size, GDP, and taxes. This info can help analyze trends and patterns in the sales data, such as which municipalities had the most sales or the highest GDP.
In addition, having this information in the dataset can make it easier to perform queries and explorations on the sales data. For example, we could use SQL queries to join the City dataset with other tables, such as the order table, to find out which municipalities had the most sales or to analyze the sales data by city or state.
city_data
The customer dataset in the Brazilian e-commerce dataset contains information about the locations of each customer. This information includes the zip code, city, and state of each customer.
This information in the dataset can help study customers' shopping habits and preferences. For example, we can use this information to understand the locations of the customers who made purchases on the platform, such as their zip codes, cities, and states. This method can help analyze trends and patterns in the sales data, such as which locations had the most sales or which customers had the most purchases.
In addition, having this information in the dataset can make it easier to perform queries and explorations on the sales data. For example, we could use SQL queries to join the customer dataset with other tables, such as the order table, to find out which customers made the most purchases or to analyze the sales data by customer location.
customer_data
customer_data.groupby('customer_city')['customer_id'].count().sort_values(ascending=False).head(10)
The geolocation dataset in the Brazilian Cities dataset contains the latitude and longitude coordinates for each zip code. This information can be used to visualize the geographic distribution of customers, sellers, or sales data.
For example, we wanted to create a map showing the locations of customers who made purchases on the platform. In that case, we could use the latitude and longitude coordinates from the geolocation dataset to plot the locations of these customers on the map. This info could help us understand the geographic distribution of the sales data and provide insights into customers' regional shopping habits and preferences.
Similarly, we wanted to create a map showing the locations of sellers who sold items on the platform. In that case, we could use the latitude and longitude coordinates from the geolocation dataset to plot the locations of these sellers on the map. This info could help us understand the sellers' geographic distribution and provide insights into the regional patterns of sales and commerce on the platform.
Overall, the geolocation dataset can be a valuable tool for creating visualizations and gaining insights into the geographic distribution of customers, sellers, and sales data on the Brazilian e-commerce platform.
geolocation_data
The payment dataset in the Brazilian e-commerce dataset contains information about how each order was paid. This information includes the payment type, payment installments, and payment values for each order.
Having this information in the dataset can help study customers' payment habits. For example, we can use this information to understand the types of payments used on the platform, such as cash, credit card, or bank transfer. This method can help analyze trends and patterns in the sales data, such as which payment types were the most common or which orders had the highest payment values.
In addition, having this information in the dataset can make it easier to perform queries and explorations on the sales data. For example, we could use SQL queries to join the payment dataset with other tables, such as the order table, to find out which orders had the highest payment values or to analyze the sales data by payment type.
payment_data
payment_data.describe()
The order dataset in the Brazilian e-commerce dataset contains information about each order placed on the platform. This information includes the customer who placed the order, the time when the order was placed, the time when the order was approved, and the time when the order was delivered.
This information in the dataset can help understand the processing time of each order. For example, we can use this information to calculate the time between when an order was placed and when it was delivered, which can give us an idea of how long it takes for orders to be processed and delivered. This info can help analyze trends and patterns in the sales data, such as which orders took the longest to be offered or which customers had the most orders.
In addition, having this information in the dataset can make it easier to perform queries and explorations on the sales data. For example, we could use SQL queries to join the ordered dataset with the customer dataset, allowing us to analyze the sales data by the customer and gain a more detailed and nuanced understanding of the sales data.
order_data
order_data.describe()
The category dataset in the Brazilian e-commerce dataset contains information about the categories of products that were sold on the platform. This information includes the name and description of each product category.
This information in the dataset can help analyze the top-rated product categories. For example, we can use this information to group the products by type and to calculate the number of sales or the average ratings for each category. This information can give us an idea of which products were the most popular or highly rated on the platform.
In addition, having this information in the dataset can make it easier to perform queries and explorations on the sales data. For example, we could use SQL queries to join the category dataset with other tables, such as the order table or the review table, to find out which categories had the most sales or the highest ratings or to analyze the sales data by type.
category_data
The seller dataset in the Brazilian e-commerce dataset contains information about the locations and unique IDs of each seller on the platform. This information includes each seller's city, state, and an individual ID.
This information in the dataset can help analyze which cities have the most sellers. For example, we can use this information to group the sellers by city and to calculate the number of sellers in each town. This info can give us an idea of which cities have the most sellers on the platform and can provide insights into the regional patterns of sales and commerce.
In addition, having this information in the dataset can make it easier to perform queries and explorations on the sales data. For example, we could use SQL queries to join the seller dataset with other tables, such as the order table or the delivery table, to find out which sellers had the most sales or the fastest delivery times or to analyze the sales data by seller location.
seller_data
seller_data.groupby('seller_city')['seller_id'].count().sort_values(ascending=False).head(10)
The review dataset in the Brazilian e-commerce dataset contains information about the reviews that were left for each product on the platform. This information includes the ratings and comments that customers left for each product, as well as the category of the product.
This information in the dataset can help find the most top-rated products or categories. For example, we can use this information to group the products by type and calculate each category's average rating. This query can give us an idea of which sorts of products were the most highly rated on the platform and can provide insights into the preferences and opinions of customers.
In addition, having this information in the dataset can make it easier to perform queries and explorations on the sales data. For example, we could use SQL queries to join the review dataset with other tables, such as the order table or the category table, to find out which products had the most sales or the highest ratings or to analyze the sales data by product category.
review_data
review_data.describe()
After identifying the tables and datasets available in the Brazilian e-commerce dataset and the Brazilian Cities dataset, the next step in the data cleaning process would be to handle any null or missing values in the data.
One way to do this is to use the dropna() method provided by the pandas' package, which can be used to remove rows that contain null or missing values from a data frame. The dropna() method would remove any rows containing null values from the df data frame, and the resulting data frame would only have rows with non-null values for all columns. Alternatively, we can use the notnull() method to select only the columns with non-null values.
order_items_data = items_data.dropna()
product_data = product_data.dropna()
city_data = city_data[city_data['CITY'].notnull()]
geolocation_data = geolocation_data.dropna()
customer_data = customer_data.dropna()
payment_data = payment_data.dropna()
review_data = review_data.dropna()
order_data = order_data.dropna()
seller_data = seller_data.dropna()
category_data = category_data.dropna()
product_data
city_data
In the orders dataset of the Brazilian e-commerce dataset, we may have information about the delivery, process, approval, and purchase time and date for each order. To make it easier to perform queries and analyses on this data, we can use the DatetimeIndex function provided by the pandas' package to normalize the purchasing, processing, and delivery times in the order data.
To do this, we would first need to load the order data into a pandas data frame and then use the DatetimeIndex function to create a DateTime index for the data frame.
The DatetimeIndex function would take the delivered_date column from the data frame and create a DateTime index for the data frame based on the values in this column. This function would allow us to perform queries and analyses on the order data using the DateTime index, making it easier to handle and manipulate.
For example, once we have created the DateTime index, we can select only the rows in the data frame that fall within a specific date range. We can also use it to group the data by date and to perform calculations such as the total number of orders or the full value of orders within a given date range. These operations can help analyze trends and patterns in the sales data and support more advanced analyses and insights.
order_data['order_purchase_year'] = pd.DatetimeIndex(order_data['order_purchase_timestamp']).year
order_data['order_purchase_month'] = pd.DatetimeIndex(order_data['order_purchase_timestamp']).month
order_data['order_purchase_day'] = pd.DatetimeIndex(order_data['order_purchase_timestamp']).day
order_data['order_approve_year'] = pd.DatetimeIndex(order_data['order_approved_at']).year
order_data['order_approve_month'] = pd.DatetimeIndex(order_data['order_approved_at']).month
order_data['order_approve_day'] = pd.DatetimeIndex(order_data['order_approved_at']).day
order_data['order_deliver_carrier_year'] = pd.DatetimeIndex(order_data['order_delivered_carrier_date']).year
order_data['order_deliver_carrier_month'] = pd.DatetimeIndex(order_data['order_delivered_carrier_date']).month
order_data['order_deliver_carrier_day'] = pd.DatetimeIndex(order_data['order_delivered_carrier_date']).day
order_data['order_deliver_customer_year'] = pd.DatetimeIndex(order_data['order_delivered_customer_date']).year
order_data['order_deliver_customer_month'] = pd.DatetimeIndex(order_data['order_delivered_customer_date']).month
order_data['order_deliver_customer_day'] = pd.DatetimeIndex(order_data['order_delivered_customer_date']).day
order_data['order_estimate_delivery_year'] = pd.DatetimeIndex(order_data['order_estimated_delivery_date']).year
order_data['order_estimate_delivery_month'] = pd.DatetimeIndex(order_data['order_estimated_delivery_date']).month
order_data['order_estimate_delivery_day'] = pd.DatetimeIndex(order_data['order_estimated_delivery_date']).day
Some of the column names may be misspelled. Suppose some column names in the Brazilian e-commerce dataset are misspelled or incorrect. In that case, we can use the rename() function provided by the pandas' package to correct the names of the columns. Having correctly spelled and adequately formatted column names can be helpful in future merge and join operations on the data.
product_data = product_data.rename(columns={'product_name_lenght':'product_name_length'})
product_data = product_data.rename(columns={'product_description_lenght':'product_description_length'})
The City data in the Brazilian Cities dataset contains many columns, and we shall identify the ones that will be useful for our studies. For instance, the location, area, gdp, taxes, age groups of people, number of companies, and number of stores like Walmart will be helpful for our studies so that we can select and rename these proper columns for sales analytics.
We review the columns in the dataset and select those relevant to our research. It is also helpful to rename the columns chosen to indicate their purpose or content more clearly. This method will make it easier to refer to the columns in our analysis and improve the clarity and interpretability of our results.
Generally, it is best to consider factors such as the amount of data available for each column and the potential impact of missing or incomplete data on the results. This method will help ensure that our analysis is as comprehensive and accurate as possible. For instance, the number of Walmart stores may affect a city's buying or selling habits.
To replace commas with spaces in the column names of a dataset, we use the .replace() method in a Python program. This method allows us to specify the character or characters we want to replace and the replacement character or string. In addition to replacing commas with spaces, we use the .lower() method to convert the column names to lowercase. The reason is that we want to make the column names more consistent or easier to read.
Once we have replaced the commas and converted the column names to lowercase, we save the modified dataset to a new file or continue our analysis.
city_data
city_data = city_data[['CITY', 'STATE','LONG','LAT','ALT','AREA','TAXES','GDP','IBGE_RES_POP','IBGE_1','IBGE_1-4','IBGE_5-9','IBGE_10-14','IBGE_15-59','IBGE_60+', 'IDHM','PAY_TV','COMP_TOT','COMP_A','COMP_B','COMP_C','COMP_D','COMP_E','COMP_F','COMP_G','COMP_H','COMP_I','COMP_J','COMP_K','COMP_L','COMP_M','COMP_N','COMP_O','COMP_P','COMP_Q','COMP_R','COMP_S','COMP_T','COMP_U'
#,'WAL-MART'
]]
city_data = city_data.rename(columns={'CITY':'city'})
city_data = city_data.rename(columns={'STATE':'state'})
city_data = city_data.rename(columns={'LONG':'longitude'})
city_data = city_data.rename(columns={'LAT':'latitude'})
city_data = city_data.rename(columns={'ALT':'altitude'})
city_data = city_data.rename(columns={'AREA':'area'})
city_data = city_data.rename(columns={'TAXES':'taxes'})
city_data = city_data.rename(columns={'GDP':'gdp'})
city_data = city_data.rename(columns={'IBGE_RES_POP':'res_population'})
city_data = city_data.rename(columns={'IBGE_1':'pop_below1'})
city_data = city_data.rename(columns={'IBGE_1-4':'pop_1_4'})
city_data = city_data.rename(columns={'IBGE_5-9':'pop_5_9'})
city_data = city_data.rename(columns={'IBGE_10-14':'pop_10_14'})
city_data = city_data.rename(columns={'IBGE_15-59':'pop_15_59'})
city_data = city_data.rename(columns={'IBGE_60+':'pop_60'})
city_data = city_data.rename(columns={'IDHM':'HDI'})
city_data = city_data.rename(columns={'PAY_TV':'pay_TV'})
city_data = city_data.rename(columns={'COM_TOT':'total_company'})
city_data = city_data.rename(columns={'WAL-MART':'walmart'})
city_data['area'] = city_data['area'].str.replace(',', '').astype(float)
city_data
city_data
city_data['city'] = city_data['city'].str.lower()
city_data
The category data also need modification in column names. Similarly, we use replace() to make the column names cleaner. Once we have cleaned up the column names in our dataset, we can use them for joins with other datasets or continue with our analysis. It is generally a good idea to ensure that our column names are straightforward and informative, as this can help make our research more efficient and effective.
category_data
category_data['product_category_name'] = category_data['product_category_name'].str.replace('_', ' ')
category_data['product_category_name_english'] = category_data['product_category_name_english'].str.replace('_', ' ')
product_data
We want an overview of our dataset, including the column names and relevant information about each column, so we use the .info() and .describe() methods in a Python program.
The .info() method will provide a summary of the data frame, including the number of rows and columns, the data type for each column, and the number of non-null values. This info can help identify any missing or incomplete data in our dataset and understand the general structure of the data.
On the other hand, the .describe() method will provide more detailed statistics about the numeric columns in our dataset. This method can include measures such as the mean, standard deviation, minimum, and maximum values for each numeric column and the number of unique and missing values. This way can help us understand the distribution and range of values in our dataset and can help us identify any potential outliers or anomalies in the data.
Using these methods, we can get a valuable overview of the cleaned dataset, help with our plan, and prepare for further analysis.
customer_data.to_csv('customer.csv')
!cp data.csv "drive/My Drive/"
from google.colab import files
customer_data.to_csv('customer.csv')
files.download('customer.csv')
customer_data.info()
order_data.info()
from google.colab import files
order_data.to_csv('order.csv')
files.download('order.csv')
items_data.info()
from google.colab import files
items_data.to_csv('items.csv')
files.download('items.csv')
product_data.info()
from google.colab import files
product_data.to_csv('product.csv')
files.download('product.csv')
city_data.info()
from google.colab import files
city_data.to_csv('city.csv')
files.download('city.csv')
customer_data.info()
from google.colab import files
customer_data.to_csv('customer.csv')
files.download('customer.csv')
geolocation_data.info()
from google.colab import files
geolocation_data.to_csv('geolocation.csv')
files.download('geolocation.csv')
payment_data.info()
from google.colab import files
payment_data.to_csv('payment.csv')
files.download('payment.csv')
review_data.info()
from google.colab import files
review_data.to_csv('review.csv')
files.download('review.csv')
seller_data.info()
from google.colab import files
seller_data.to_csv('seller.csv')
files.download('seller.csv')
category_data.info()
from google.colab import files
category_data.to_csv('category.csv')
files.download('category.csv')
order_items_data.agg(
{
"price": ["min", "max", "median", "mean", "std", "skew"],
"freight_value": ["min", "max", "median", "mean", "std", "skew"],
}
)
product_data.agg(
{
"product_name_length": ["min", "max", "median", "mean", "std", "skew"],
"product_description_length": ["min", "max", "median", "mean", "std", "skew"],
"product_photos_qty": ["min", "max", "median", "mean", "std", "skew"],
"product_weight_g": ["min", "max", "median", "mean", "std", "skew"],
"product_length_cm": ["min", "max", "median", "mean", "std", "skew"],
"product_height_cm": ["min", "max", "median", "mean", "std", "skew"],
"product_width_cm": ["min", "max", "median", "mean", "std", "skew"],
}
)
The product and product category tables are strongly related to each other. We use the pandas' library and the merge() method to merge two datasets in Python. This method allows us to combine the data from two separate datasets into a single data frame by matching the values in one or more common columns.
After merging, the resulting data frame will include all of the columns from both tables and any additional columns created as a result of the merge.
Once we have merged the two datasets, we can use the resulting data frame for further analysis or generate reports or visualizations. This method can provide a complete view of our data and help us understand the relationships between products and product categories.
Finally, we gain an overview of each product info like the ids, volumes, as well as their categories.
product_data
category_data
import sqlite3
import pandasql as ps #SQL on Pandas Dataframe
# product_id, product_category_name, product_name_length, product_description_length, product_photos_qty
# product_weight_g, product_length_cm, product_height_cm, product_width_cm, product_category_name_english
query = """
SELECT product_id, product_name_length, product_description_length, product_photos_qty,
product_weight_g, product_length_cm, product_height_cm, product_width_cm,category_data.product_category_name_english
FROM product_data
JOIN category_data
ON product_data.product_category_name = category_data.product_category_name
"""
cleaned_product_data= ps.sqldf(query, locals())
cleaned_product_data
We use pandas SQL for API queries. For more detailed API queries can refer to our API documents.
The panda SQL library allows us to use SQL-like syntax to query and manipulate data frames in Python. This method can be helpful if we are more familiar with SQL or want to use SQL-like queries to work more efficiently and intuitively with our data.
To use panda SQL, we first need to import the sqldf function from the library, which we have done previously. This function allows us to run SQL-like queries on our data frames using a syntax similar to standard SQL.
In addition to SELECT, FROM, and WHERE, panda SQL supports other common SQL keywords and operators, such as GROUP BY, HAVING, ORDER BY, JOIN, and more. We use these keywords and operators to perform more complex queries and manipulations on our data frames.
We use the SELECT and WHERE clauses in a SQL query to search for a specific seller using their ID. The SELECT clause is used to specify which columns we want to retrieve from the table, and the WHERE clause is used to determine the conditions the data must meet to be included in the result set. "SELECT " would return all columns (indicated by the ) for the seller with the specified ID. We can also specify specific columns in the SELECT clause to only retrieve particular fields rather than all of them.
Similarly, we use this method to filter sellers by city, by state. And search customers by city and by state. And likewise, for filtering orders by id, customer id, and status.
Filtering sellers by city by state and searching customers by city and state are standard methods to organize and manage customer and seller information in a database. This method can help analyze sales data and identify trends or patterns based on location. Filtering orders by ID, customer ID, and status can also be helpful for tracking and managing orders within a business. These methods allow for efficient retrieval and organization of information, which can be beneficial for various purposes, such as managing inventory, analyzing sales data, and providing customer support.
seller_data
query1 = """
SELECT *
FROM seller_data
where seller_id = "51a04a8a6bdcb23deccc82b0b80742cf"
"""
df1= ps.sqldf(query1, locals())
df1
query2 = """
SELECT *
FROM seller_data
where seller_city = "braganca paulista"
"""
df2= ps.sqldf(query2, locals())
df2
query3 = """
SELECT *
FROM seller_data
where seller_state = "SP"
"""
df3= ps.sqldf(query3, locals())
df3
customer_data
query4 = """
SELECT *
FROM customer_data
where customer_unique_id = "84732c5050c01db9b23e19ba39899398"
"""
df4= ps.sqldf(query4, locals())
df4
query5 = """
SELECT *
FROM customer_data
where customer_city = "taboao da serra"
"""
df5= ps.sqldf(query5, locals())
df5
query6 = """
SELECT *
FROM customer_data
where customer_state = "SP"
"""
df6= ps.sqldf(query6, locals())
df6
order_data
query7 = """
SELECT *
FROM order_data
where order_id = "ad21c59c0840e6cb83a9ceb5573f8159"
"""
df7= ps.sqldf(query7, locals())
df7
query8 = """
SELECT *
FROM order_data
where customer_id = "8ab97904e6daea8866dbdbc4fb7aad2c"
"""
df8= ps.sqldf(query8, locals())
df8
query9 = """
SELECT *
FROM order_data
where order_status = "canceled"
"""
df9 = ps.sqldf(query9, locals())
df9
order_data
query10 = """
SELECT *
FROM order_data
where order_purchase_year < 2018
"""
df10= ps.sqldf(query10, locals())
df10
Finally to retrive infomation related to geolocation, we use natural join for all datasets we have. so that we gain a geographical insights on the locations of customers, sellers, and products. The use of natural join in a database can be helpful for combining multiple datasets that share common attributes, such as the location of customers, sellers, and products. This allows for the retrieval of information related to geographical location, and can provide valuable insights into the distribution of customers, sellers, and products across a given region. By using natural join to combine these datasets, it is possible to gain a more comprehensive understanding of the relationships between different entities within a database. This can be useful for a variety of purposes, such as analyzing sales data, identifying trends and patterns, and making more informed decisions about how to allocate resources and manage operations.
geolocation_data
query11 = """
SELECT *
FROM order_data
NATURAL JOIN items_data
NATURAL JOIN payment_data
NATURAL JOIN cleaned_product_data
NATURAL JOIN customer_data
NATURAL JOIN seller_data
NATURAL JOIN review_data
NATURAL JOIN geolocation_data
"""
df11= ps.sqldf(query11, locals())
df11
EDA helps us get insights from another dimension of our datasets. An exploratory data analysis, or EDA, analyzes a dataset to summarize its main characteristics, often with visual methods. By performing an EDA, we can gain insights into the dataset's distribution, patterns, and relationships. This way can help identify potential trends, anomalies, or outliers that may require further investigation.
One way to perform an EDA is to query the dataset for specific periods, such as 2017 and 2018, and compare the results. This method can help us understand how the number of orders changed over time and identify any trends or patterns that may exist. For example, the number of orders increased significantly in 2018 compared to 2017, indicating potential growth in the business or market.
In addition to querying for specific periods, we can also use visual methods to explore the data. This query can include creating graphs, plots, or charts to visualize the data and help us better understand its characteristics.
Overall, an EDA can provide valuable insights into a dataset and help us identify trends, patterns, and relationships that may not be immediately apparent. Using a combination of querying and visual methods, we can gain a deeper understanding of our data and use it to inform analysis and decision-making.
cleaned_product_data
items_data
order_data
number_of_orders = order_data[(order_data['order_purchase_year'] == 2017)|\
(order_data['order_purchase_year'] == 2018)].count()
number_of_orders
no_orders_2017 = order_data[(order_data['order_purchase_year'] == 2017)].count()
no_orders_2018 = order_data[(order_data['order_purchase_year'] == 2018)].count()
increment = ((no_orders_2018 - no_orders_2017) / no_orders_2017 * 100).round(2)
increment
In addition to exploring changes in the number of orders, an EDA can also be used to examine changes in sales or the total amount of money generated by the orders. This way can help understand the growth or performance of the business over time.
To explore changes in sales between 2017 and 2018, we can query the dataset to retrieve the total sales for each year. We can then compare the results to determine the percentage increase or decrease in sales between the two years.
Once we have calculated the percentage change in sales, we can use visual methods to explore the data further. This info can include creating a graph or chart to show the difference in sales over time or comparing the sales data for different years using a bar chart or line graph. This way can help us identify any trends or patterns in the data and gain insights into the growth or performance of the business.
Overall, an EDA can be a powerful tool for exploring changes in sales over time and can provide valuable insights into the performance and growth of a business.
payment_data
payment_order = payment_data.merge(order_data, how='outer', on='order_id')
total_sales = payment_order.loc[(payment_order['order_purchase_year'] == 2017) |\
(payment_order['order_purchase_year'] == 2018), 'payment_value'].sum().round(2)
total_sales
sales_2017 = payment_order.loc[(payment_order['order_purchase_year'] == 2017), 'payment_value'].sum().round(2)
sales_2018 = payment_order.loc[(payment_order['order_purchase_year'] == 2018), 'payment_value'].sum().round(2)
increment = ((sales_2018 - sales_2017) / sales_2017 * 100).round(2)
increment
In addition to exploring changes in the number of orders and sales, an EDA can also be used to examine changes in the number of customers over time. This way can help understand the business's growth or performance and identify potential trends or patterns in customer behavior.
To explore changes in the number of customers between 2017 and 2018, we can query the dataset to retrieve the total number of customers for each year. We can then compare the results to determine the percentage increase or decrease in the number of customers between the two years.
In addition to examining changes in the number of customers over time, we can also use an EDA to explore the geographic distribution of customers. This way can help identify potential trends or patterns in customer behavior and understand the potential market for the business. To do this, we can query the dataset to retrieve the state or country of each customer. This method can help us understand the geographic reach of the business and identify any areas that may be potential growth markets.
Overall, an EDA can provide valuable insights into changes in the number of customers over time and the geographic distribution of customers.
customer_data
customer_order = customer_data.merge(order_data, how='left', on='customer_id')
number_of_customers = customer_order[(customer_order['order_purchase_year'] == 2017)|\
(customer_order['order_purchase_year'] == 2018)].nunique()
number_of_customers
no_cust_2017 = customer_order[(customer_order['order_purchase_year'] == 2017)].count()
no_cust_2018 = customer_order[(customer_order['order_purchase_year'] == 2018)].count()
increment = ((no_cust_2018 - no_cust_2017) / no_cust_2017 * 100).round(2)
increment
states = customer_data['customer_state'].nunique()
states
An EDA can also be used to explore the review scores of products and to identify trends or patterns in customer feedback. This tool can help understand the quality and performance of the products and identify potential areas for improvement.
To explore the review scores of products, we can query the dataset to retrieve the review scores for each product. We can then calculate the percentage of products with a review score of five, which is often considered the highest rating. This way can help us understand customers' overall satisfaction with the products and identify any particularly popular or well-received products.
In addition to exploring the review scores of individual products, we can also use an EDA to identify related products and gain insights into potential market trends. This info can be useful for understanding the products that customers are interested in and for identifying potential areas for growth or expansion. To do this, we can query the dataset to retrieve the most frequently purchased products,
Overall, an EDA can provide valuable insights into the review scores and related products in a dataset and help us understand the performance and quality of the products, as well as potential market trends.
review_data
number_of_5_review = review_data[review_data['review_score'] == 5].count()
number_of_5_review
percentage = (number_of_5_review / review_data['review_score'].count() * 100).round(2)
percentage
In addition to exploring the review scores and related products, an EDA can also be used to examine the sales of different product categories. This method can help understand the popularity and performance of different types of products and identify potential trends or patterns in customer behavior.
To explore the sales of different product categories, we can use a group-by method to group the sales data by type. This tool will allow us to calculate the total sales for each category and compare the results to determine which classes are the most popular.
In addition to exploring the sales of different product categories, we can also use an EDA to examine the sales trends over time. This way can help understand the business's growth or performance and identify potential expansion or improvement opportunities. To do this, we can query the dataset to retrieve the sales data for different periods, such as monthly or quarterly sales.
Overall, an EDA can provide valuable insights into the sales of different product categories and help us understand the popularity and performance of different types of products.
sales_of_product = payment_order.merge(items_data, on='order_id').merge(cleaned_product_data, on='product_id')\
[['order_id', 'order_purchase_year', 'product_category_name_english', 'payment_value']]\
.groupby(['product_category_name_english', 'order_purchase_year']).sum()
sales_of_product
cleaned_product_data.groupby('product_category_name_english')['product_category_name_english'].count().sort_values(ascending = False).head(10)
payment_order.sort_values(by='order_purchase_timestamp')[['order_id', 'order_purchase_timestamp', 'payment_value']]
In addition to exploring the sales of different product categories, an EDA can also be used to examine the attributes of other cities, such as GDP, residential population, and HDI. This way can help understand different cities' purchasing potential and economic conditions and identify potential trends or patterns in city data.
To explore the attributes of different cities, you can query the dataset to retrieve the relevant data, such as GDP, residential population, and HDI. You can then use sorting or grouping methods to organize the data and compare the results to identify which cities have the highest values for the selected attributes.
Overall, an EDA can provide valuable insights into the attributes of different cities and help you understand other regions' economic conditions and purchasing potential.
city_data[['city', 'state', 'gdp']].sort_values('gdp', ascending=False).head(10)
city_data[['gdp', 'res_population', 'HDI', 'COMP_TOT', 'pay_TV']].describe()
Overall, the Brazilian E-commerce and the Brazilian Cities datasets are rich with information and provide a wealth of opportunities for exploration and analysis. Although some cleaning and column renaming may be necessary, the datasets are generally well-structured. They contain a wide range of data that can be used to gain insights into the e-commerce market in Brazil and the economic conditions and purchasing potential of different cities.
Pandas and Pandas SQL are decisive for the data cleaning processes. By using exploratory data analysis (EDA) techniques, we can better understand the data and identify potential trends, patterns, and relationships within the datasets.
In addition to performing an EDA, we will use the data to create visualizations and web applications or interactive tools that allow users to explore the data and gain insights on their own. Our data will be more accessible and understandable in web applications, which will help others to understand and use the data to inform their analysis and decision-making.